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Abstract 


Parser combinators are a popular and elegant approach for 
parsing in functional languages. The design and implemen- 
tation of such libraries are well discussed, but having a well- 
designed library is only one-half of the story. In this paper 
we explore several reusable approaches to writing parsers in 
combinator style, focusing on easy to apply patterns to keep 
parsing code simple, separated, and maintainable. 
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1 Introduction 


Design patterns, popularised by Gamma et al. [9], are soft- 
ware design principles that are not necessarily rigid or must 
be adhered to but are a guide for solving common problems 
and structuring large bodies of code. Their most prolific use is 
within the Object-Oriented Programming (oop) community; 
within the Functional Programming (FP) community, many 
patterns are simply implemented using higher-order func- 
tions. In fact, one example of the strength of higher-order 
functions is the development of combinator libraries, and in 
particular parser combinators [7, 14, 15, 19, 27, 29, 31, 32]. But 
while Fp provides many of these beautiful abstractions, not 
enough is said about how to actually use them in a maintain- 
able and scalable way. Indeed, this paper aims to highlight 
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several parser combinator design patterns; these patterns, 
certainly not exhaustive, should: 


e Structure and organise larger parsers 

e Separate the various concerns of different parts of parsers 
e Keep the intention and shape of the grammar clear 

e Create informative error messages 

e Guide the implementation with strong types 


As mentioned, parser combinators are an elegant func- 
tional approach to performing parsing of grammars, includ- 
ing context-sensitive, embraced by many in the Haskell com- 
munity. Compared to parser generator tools, like Haskell’s 
Happy, parsers developed with combinators are written in 
pure Haskell as a Domain-Specific Language. This paper 
assumes some knowledge of parser combinators and the 
tutorial by Swierstra [28] serves as a good introduction. 

The most ubiquitous family of combinators in Haskell is 
the parsec [19] family: consisting of the libraries parsec, 
attoparsec, and megaparsec. This family is primarily char- 
acterised by their shared semantics for backtracking, where 
alternative parts of a grammar may only be taken when in- 
put has not been consumed in another; in particular they 
all leverage the try combinator to opt-in to backtracking, 
as opposed to a cut combinator to opt-out. Practically, this 
means there are some considerations when implementing 
some of our patterns within this family, concerning try, that 
do not occur in other libraries. 

The presentation of our patterns will be anchored around 
a main running example developed in a generic and simple 
implementation of a parsec family library in Haskell. The 
aim is to provide a solid foundation without having to focus 
on the specifics of any particular api or their individual 
quirks. Importantly, our patterns are not only useful for 
parsec or necessarily Haskell and apply generally. 


1.1 An Introductory Example 


The classic example used to demonstrate how to use parser 
combinators is some variant of an expression language, the 
grammar for which is shown in Figure 1. To start with, this 
language supports: the standard arithmetic operators, each 
of which is left-associative, denoted by the left recursion; 
numbers; variables; parenthesised expressions; and prefix 
negation operator negate, modelled by the following ast: 
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(digit) =‘ .. ‘9’ 
(number) ::= (digit)+ 
(ident) —::= (alpha) (alpha-num)* 
(expr) = (expr) ‘+’ (term) 

| (expr) ‘-’ (term) 

| term) 
(term) — ::= (term) ‘x’ (negate) | (negate) 
(negate) ::= ‘negate’ (negate) | (atom) 
(atom) = ‘( (expr) ‘)’ 


| <number) | (ident) 


Figure 1. Grammar 


data Expr = Num Int | Var String 
| Neg Expr | Mul Expr Expr 
| Add Expr Expr | Sub Expr Expr 


The parser will directly construct this datatype: combinator 
libraries can incorporate semantic actions directly into the 
parser itself. For now, the ast is monolithic and homoge- 
neous, but ideally, it would itself mirror the grammar. The 
corresponding parser for the grammar is given in Figure 2. 
This parser maps closely to the original grammar: rule al- 
ternatives are expressed with the «» “or” combinator, and 
semantic actions and sequencing are applied with the stan- 
dard Applicative <$> and <*> combinators, pronounced “fmap” 
and “ap” respectively. The Kleene star and plus operations are 
implemented with Alternative many and some respectively. 


The Combinators. For reference, here is a summary of 
the combinators used in this paper, along with their types: 


(<$>) 2: Functorf > (a—-b) ~fa—-fb 

(<$) :Functorf>a—->fb—-fa 

pure ::Applicativef > a— fa 

(<*>) 2 Applicative f => f(a — b) ~fa—fb 
(<*) 2 Applicativef > fa—o>fb—ofa 

(*») 2 Applicativef > fa—o>fb—-fb 


(<*>) :: Applicative f > fa > f(a b) > fb 
(<>) 2 Applicative f > fa — f [a] > f [a] 


This first set of combinators are responsible for sequencing 
actions (in this case parsing actions) and combining their 
results in a way that follows the combinator’s type signature: 
<*>, for instance, will apply the function returned by the first 
action to the value returned by the second. Notably, pure 
does nothing in a parsing sense except return a result. 


(>) 1: Alternativef > fa->fa-fa 
many :: Alternative f > fa — f [a] 
some :: Alternative f > fa — f [a] 
choice :: Alternative f > [fa] > fa 
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digit =oneOf['d'..'9'] 
number = fold! addDigit 0 <$> some digit 
ident = alpha<:> many alphaNum 
expr = = Add <$> expr «> (char '+' *> term) 
<p Sub <$> expr <*> (char '-' %> term) 
<p term 
term == Mul <$> term <*> (char '*' «> negate) <> negate 
negate = Neg <$> (string "negate" *» negate) <> atom 
atom =char'(' * expr<« char ')' 


<> Num <$> number <p> Var <$> ident 
addDigit nd = n « 10 + digitTolnt d 


Figure 2. Parser 


The next set of combinators are responsible for choice and 
data-independent branching: «» will try parsing its first ar- 
gument, if it fails then the second argument is tried (in some 
libraries, only if the first consumed no input). Both many 
and some are built on top of this and <:>, to try performing 
an action multiple times until it fails, collecting the results 
into a list, with many requiring zero or more successes, and 
some requiring one or more. The choice combinator tries 
each action in the list in turn, until one succeeds, using <p. 

The canonical form for terms using these operators is 
sequencing operations separated by alternative operations: 
usually achieved by making the «» combinator infix] 3 and 
the rest infixl 4, so that <» binds weaker than the others. 


char :: Char > Parser Char 
string :: String — Parser String 
oneOf :: [Char] — Parser Char 


try :: Parsera — Parsera 


The final group of combinators are specific to parsers: char 
parses a specific character; string parses a specific string of 
characters one after the other; and oneOf is a character class, 
using choice to parse any of the provided characters. The try 
combinator in the parsec family undoes input consumption 
on failure, allowing «» to take its second branch. 


1.2 The Common Problems with Combinators 


Whilst the example parser perfectly maps to the grammar for 
the language it, in fact, exhibits several common problems: 


Left-Recursive Expressions. The biggest problem with 
this parser is that it is left-recursive. For many parser combi- 
nator libraries, this will cause infinite recursion at runtime 
since the recursion is unguarded by input consumption. 

Instead of using traditional grammar transformations on 
both the grammar and the parser to left-factor [1, 21] it, our 
first pattern addresses it idiomatically, whilst introducing 
extra type safety (Section 2). 


Design Patterns for Parser Combinators 


In-Place Lexing. The example parser does not make any 
attempt to parse whitespace, and there are some counter- 
intuitive parses possible from poor lexing. 

To combat this, whitespace handling and general prac- 
tices of lexing are discussed (Section 3), and measures are 
introduced to abstract them, further refining the design. 


Bookkeeping Information. The design does not lend it- 
self well to changing requirements: suppose position infor- 
mation must be added to the Ast, the parser would be altered 
in a way that obscures its underlying purpose and structure. 

As a result, further separation is introduced between the 
semantic action of the parser (in this case AsT construc- 
tion) and the parsing logic that represents the grammar (Sec- 
tion 4). The result of this pattern will be code that is robust to 
changes in AsT requirements and more effectively separates 
the concerns of the code. This will be illustrated by extending 
the grammar to handle assignments and statements. 


Helpful Errors. To show how to improve errors, the gram- 
mar is extended again with conditional statements, and both 
positive and negative lookahead will be leveraged to make 
bespoke error messages for the user (Section 5). 


Related work is discussed (Section 6) and the parser’s 
development is summarised and reflected on at the very end 
of the journey, with some closing remarks (Section 7). 


2 Expression Parsing 


Expression parsers have a very standardised shape in a gram- 
mar. Consider the following grammar: 


(pred) ::= (comp) ‘&& (pred) | (comp) 


(comp) = (expr) “<’ (expr) | (expr) ‘=’ (expr) | (expr) 
(expr) s= (expr) ‘+’ (term) | (expr) ‘-’ (term) | (term) 
(term) ::= (term) ‘x’ (atom) | (atom) 

(atom) ::= ‘C (expr) ‘)’ | (number) | (ident) 


This consists of various rules each referring to the next, 
denoting the precedence of each operator ((term) is tighter 
than (expr)). When two operators appear in the same rule 
they share the same precedence (like < and =; or + and -). 
Usually, the location of recursion — or its absence — denotes 
the associativity — or lack thereof — of the operator. Recursion 
on the left is left-associative (as in (expr)), on the right is 
right-associative (as in (pred)), and no self-recursion is non- 
associative (as in (comp)). Most parser combinator libraries 
operate as recursive descent including the parsec family: 
this means that left recursion results in non-termination. 


Problem 1: Left-Recursive Expressions. Expressions 
with left recursion cannot be encoded by recursive descent 
parsers and will diverge. 


One solution to handling left recursion in a grammar is 
to change the grammar using left-factoring, however, it is 
preferable to leave the grammar alone: the grammar should 
not have to be tailored to the implementation. 
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Anti-pattern 1: Grammar Refactoring. Modifying the 
grammar to remove left recursion exposes implementation 
details and complicates the grammar. 


2.1 The Homogeneous Chain Combinators 


Hutton and Meijer [15] discuss the classic technique of re- 
placing left recursion with iteration or recursion with an ac- 
cumulating parameter. The resulting idiomatic combinators 
are known traditionally as the chain combinators. Conven- 
tionally, parser combinator libraries usually define two sorts 
of chains: chainl1 and chainr1. 


chainl1 :: Parser a — Parser (a > a > a) — Parsera 
chainr1 :: Parser a — Parser (a > a > a) — Parsera 


The chainx1 p op combinator should parse one or more 
ps, separated by ops, applied x-associatively. 


Pattern 1a: Homogeneous Chains. For binary operators 
where the associativity is not specified, use chainl1 or 
chainr1 to combine operands with their operators. 


expr = chainl1 term (Add <$ char '+' «> Sub <$ char '-') 
term = chainl1 negate (Mul <$ char '*') 


This parser, in conjunction with the original definitions for 
negate, atom, number, and ident now works correctly and 
also does not backtrack. However, the use of the Homoge- 
neous Chains pattern implies that the grammar does not 
specify the associativity like, for example, the following: 


(func) ::= (func) *.’ (func) | (lambda) 


The recursion on both sides of the “ 


2 


.” operator in the 
(func) rule means that it is associative, without specifying 
whether it is to the left or the right. This makes the rule a 
great candidate for the Homogeneous Chains patterns, since 
the concrete associativity is left as an implementation detail: 
since chainl1 and chainr1 share the same type, changing the 
associativity would be seamless. With our example, however, 
the grammar does specify the exact associativity of the oper- 
ators, so the ease of exchanging one chain for the other can 
allow the parser to be unfaithful to the grammar. 


2.2 The Heterogeneous Chain Combinators 

The problem with using the conventional chains introduced 
in Section 2.1 for our grammar is that they fail to distinguish 
(except lexically) between each other. It is very easy to acci- 
dentally use the wrong one and silently change the meaning 
of the parser and make it unfaithful. Instead, we offer two 
new chains with refined types along with their definitions: 


infixl1 :: (a — b) — Parsera 
— Parser (b > a > b) > Parser b 
infixl1 wrap p op = (wrap <$> p) <*> rest 
where rest = flip (-) <$> (flip <$> op <*> p) <*> rest 
<> pure id 
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infixr1 :: (a > b) — Parser a 
— Parser (a — b > b) — Parser b 
infixr] wrap p op = 
p <««> (flip <$> op <*> infixr1 wrap p op <> pure wrap) 


The type of the chains in this formulation make it much 
clearer which is which: the operators produce bs more tightly 
in the correct position. This comes at an ergonomic cost, 
since a wrapping function that transforms the terminal item 
in the chain into the correct type must be provided. Like 
chainl1 and chainr1, their definitions generalise the shape of 
left- and right-factored parsers, providing a recipe for how 
to transform the grammars by hand, though a strength of 
combinators is allowing these higher-order parsing recipes 
to be defined and used instead. These new chains are related 
to the classic versions: 

infixl1 id 

infixr1 id 


chainl1 = 
chainr1 = 


Additionally, postfix and prefix serve as a natural extension: 


postfix :: (a — b) — Parser a 
— Parser (b — b) — Parser b 
postfix wrap p op = (wrap <$> p) <*> rest 
where rest = flip (-) <$» op <*> rest «> pure id 

prefix :: (a — b) — Parser (b > b) 

— Parser a — Parser b 
prefix wrap op p = op <*> prefix wrap op p <b wrap <$> p 
These chain combinators handle many applications of postfix 
operators or prefix operators to a terminal item. In particular, 


the definition of postfix is very close in shape to infixl1’s, 
indeed, infixl1 can be easily given in terms of postfix: 


infixl1 wrap p op = postfix wrap p (flip <$> op <*> p) 


Pattern 1b: Heterogeneous Chains. For associative oper- 
ators where operand types may differ, use infixl1 or infixr1 
to combine operands with their operators, in conjunction 
with strongly typed semantic actions. 


To properly leverage the additional type-safety provided 
by the new heterogeneous chains, the ast itself must change: 


data Expr = Add Expr Term | Sub Expr Term 
| Of Term Term 
data Term = Mul Term Negate | OfNegate Negate 


data Negate = Neg Negate | OfAtom Atom 


data Atom = Num Int | Var String | Parens Expr 


This datatype more accurately describes the shape of the 
grammar; this is good because it provides a second layer 
with which to check that the parser is correct. As an added 
benefit, functions that consume this datatype can rely on the 
shape of the ast being left- or right-associated. The parser 
now has to be adapted to the new type: 
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expr = infixl1 Of Term term 
(Add <$ char '+' <> Sub <$ char '-') 


term = infixl1 OfNegate negate (Mul <$ char '*') 
negate = prefix OfAtom (Neg <$ string "negate") atom 
atom =char '(' *> (Parens <$> expr) <* char ')' 


<> Num <$> number <> Var <$> ident 


This new parser conforms to the new datatype and, as such, 
if the programmer accidentally switches an infixl1 for an 
infixr1, this parser would fail to typecheck. Unfortunately, 
the same level of guarantee is not present for prefix and 
postfix, other than their reversed arguments. Unlike the pre- 
vious, homogeneous, version of the parser, the “recursive” 
point in atom has to be wrapped in the Parens constructor. 


2.3 Generalising to Precedence Tables 


Section 2.1 introduced chainl1 as a way of managing left re- 
cursion in a grammar, refactored the parser to eliminate left 
recursion, and introduced heterogeneous chains to support 
datatypes that encode the grammar more precisely. This is 
a good first step but there is still a lot of mechanical busy- 
work to encode the precedence and associativity of opera- 
tors, not to mention the daunting prospect of running out 
of four-letter parser names as the grammar expands! Parser 
generator tools often expose a more concise and scalable 
representation of expression parsers: this same experience 
can be developed for parser combinators. 

A precedence combinator accepts a table of operator prece- 
dence along with a base “atom”. This is a combinator that 
is found, in some shape or form, in most parser combinator 
libraries — including the parsec family — as well as in the 
literature [3, 12, 13]. This table is normally implemented as 
a list of some Op datatype that expresses one or more opera- 
tors of some associativity and fixity. However, [Op a] would 
imply a homogeneous table, and this is not desirable for the 
heterogeneous parser in Section 2.2. A heterogeneous list 
will fit better: 


data Fixity a b sig where 


InfixL :: Fixity ab (b > a > b) 
InfixR :: Fixityab(a— b— b) 
InfixN :: Fixity ab (a — a > b) 
Prefix :: Fixity ab (b > b) 
Postfix :: Fixity a b (b > b) 


data Op a b where 
Op :: Fixity a b sig — (a — b) — Parser sig — Opab 
data Prec a where 
Level :: Preca — Opab — Precb 
Atom :: Parser a — Preca 
--infixl 5 
--infixr 5 


(+) = Level 


(<) = flip (+) 


Design Patterns for Parser Combinators 


The Fixity datatype relates the input a with the output b of 
an operator, given by the type sig. The InfixN constructor 
represents non-associative operators, which can appear at 
most once!. The Fixity datatype is useful since it detaches 
any potential functions building on Op from needing to 
worry about considering every specialised fixity. The Op 
datatype is a defunctionalised representation of a heteroge- 
neous chain, partially applied to the operator but not the 
atom. The list-like Prec structure combines smaller prece- 
dence tables with a new layer, connected by the Op. To make 
this more ergonomic, the (+) and (+) operators allow the 
table to be built from strongest to weakest or weakest to 
strongest”. A precedence combinator is a “fold” over a table: 


precedence :: Prec a — Parser a 

precedence (Atom atom) = atom 

precedence (Level Ivls ops) = con (precedence lvls) ops 

where con :: Parser a > Opab — Parser b 
con p (Op InfixL wrap op) = infixl1 wrap p op 
con p (Op InfixR wrap op) = infixr] wrap p op 
con p (Op InfixN wrap op) = 
p <«*> (flip <$> op <*> p <> pure wrap) 

con p (Op Prefix wrap op) = prefix wrap op p 
con p (Op Postfix wrap op) = postfix wrap p op 


precHomo :: Parser a > [Op aa] — Parsera 
precHomo atom = precedence - foldl (»+) (Atom atom) 


The idea is to traverse the table from the deepest layer out- 
wards, converting each operator in turn into the correspond- 
ing chain. The homogeneous precedence parser precHomo 
can easily be recovered as a fold over a regular list. 

Using helper functions can make the Op datatype easier 
to use by providing common wrapping functions: these are 
id and functions that implement a sort of sub-typing. 


gops :: Fixity a b sig — (a > b) — [Parser sig] — Opab 
gops fixity wrap = Op fixity wrap - choice 
ops :: Fixity aa sig — [Parser sig] > Opaa 
ops fixity = gops fixity id 
class sub < sup where 
upcast ::sub — sup 
downcast :: sup — Maybe sub 
sops ::a < b => Fixity ab sig — [Parser sig] — Opab 
sops fixity = gops fixity upcast 


The gops function takes many operators at the same level 
and combines them into one with choice. The ops function 
supports homogeneous operators with id. The sops function 
uses upcasting to perform wrapping: in Functional-oop lan- 
guages, datatype hierarchies are often made using subtype 


1Notice that b appears in neither the operator’s left nor right positions. 
?The operators are eating the levels with the higher precedence. 
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polymorphism to avoid explicit wrapper constructors. This 
is mimicked by a < b: a function is designated as the cast. 


Pattern 1c: Precedence Tables. For expressions, use the 
precedence combinator to deal with both fixity and prece- 
dence concisely. 


By giving each layer of the AsT its own (<) instance — 
using the OfX constructors — a simpler and more concise 
definition of expr can be given using precedence and sops: 


expr = precedence $ 
sops InfixL [Add <$ char '+', Sub <$ char '-'] « 


sops InfixL [Mul <$ char '*' ] «K 
sops Prefix [ Neg <$ string "negate" | * 
Atom atom 


By using sops, the wrapper constructors in the original 
parser are avoided as they are resolved implicitly. Like the 
version with heterogeneous chains, adding, removing, or al- 
tering any layers will generate a type error’. Any alterations 
to the parser need to be reflected in the datatype itself. 


2.4 Aside: Folds for Parsers 


The left chain (Section 2.1) is a conversion from recursion to 
iteration or, in this case, recursion using an accumulating pa- 
rameter (realised here by composing functions). The relation 
between iteration with fold and recursion with accumulation 
is an instance of deforestation [10, 30]. The idea is to fuse the 
building and the consumption of an intermediate structure 
— in this case lists — to eliminate it. In fact, in the existing 
parser, there is an example of a structure that is built up and 
immediately crushed back down: 


number = fold! addDigit 0 <$> some digit 


The combinator some returns a list of results, but this list is 
then folded immediately: this is wasteful. Happily, the chains 
are useful for trimming the forest, in particular postfix, prefix, 
and infixl1 can be used to create so-called parser folds. 


manyr :: (a > b > b) > b — Parser a > Parser b 
manyr f k p = prefix id (f <$> p) (pure k) 

manyl :: (b > a > b) > b > Parser a > Parser b 
manyl f k p = postfix id (pure k) (flip f <$> p) 
somer :: (a > b > b) > b — Parser a > Parser b 
somer fk p = f <$> p <*> manyr fk p 


somel :: (b — a > b) > b — Parser a > Parser b 


somel fk p = infixl1 (fk) p (pure f) 


Just as foldr (:) [ ] is the identity fold, manyr (:) [ ] = many 
and somer (:) [] = some. Here, the heterogeneous infixl1 
can be used with the wrapping function f k representing 
the initial application of the accumulator to the first item in 
the fold: this would not be possible with the homogeneous 


3In fact, this mechanism successfully caught a typo in this example parser! 
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chainl1. The relation between a deforested parser fold and 
the “forested” fold with iterative combinator is as follows: 


manyr f k p = foldr f k <$» many p 
manyl fk p = foldl f k <$» many p 
somer f k p = foldr f k <$> some p 
somel f k p = foldl f k <$> some p 


With this in mind, the definition of number can be simplified: 
number = somel addDigit 0 digit 


While there is little aesthetic difference between the old 
version and the new one, the second is more efficient as it 
does not have to build a list, instead consuming its elements 
in situ. Really, it serves to highlight the flexibility of chains 
and their applicability in a variety of different scenarios. 


Discussion. The issues of left recursion and organising 
expression parsers can be cleanly eliminated with the help 
of precedence and chains. On the surface, it appears as if 
precedence is a clear win, however, as illustrated by Sec- 
tion 2.4, chains are more versatile than they might first ap- 
pear, and, arguably, precedence is overkill for only a single 
layer. This appears in practice from time to time: grammars 
where “;” is considered an operator, for instance. 


3 Effective Lexing 


When writing parsers with a parser generator tool, there is 
often a distinction between the lexical analysis and the pars- 
ing stages. This is often exemplified by having two distinct 
tools: alex and happy, for instance. With parser combina- 
tors, however, the distinction is much less clear but no less 
important to consider. Even though the tool used for lexing 
and parsing is the same, clean and well-separated code leads 
to more maintainable and readable parsers. 

The parser refined in the previous section still has some 
issues. Some easy ones to see are the following (where Left 
represents failure and Right represents success): 


parse expr'"X + 7" = Right (Var "x") 
parse (expr <x eof) "x + 7" =Left"(1, 2): unex[..]" 


parse expr "negatex" = Right (Neg (Var "x")) 


All three of these problems are caused by the absence of 
proper lexing parsers and whitespace handling: "negatex" 
should be treated as a single identifier, the parser is not 
greedy, and it cannot consume spaces. The naive solution 
would be to insert whitespace consuming logic everywhere 
it is necessary straight into the parser: this is very noisy. 


Problem 2: In-Place Lexing. Dealing with tokens while 
parsing is intrusive to the overall structure of the parser 
and introduces clutter. 


Traditionally, parsers are designed with two phases: lexing 
and then parsing. The idea is to build up a stream of tokens 
instead of working on raw characters. This allows reading 
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whitespace and chunking the input to be abstracted away 
from the grammar. Running a lexing pass upfront has the 
disadvantage that tokens will be generated greedily and 
missing any contextual information about where the token 
may lie relative to the grammar: a '-' might represent a 
subtraction or be part of an integer literal but the lexer cannot 
know which, so the choice is often deferred to the parser 
and unary negation typically replaces negative literals. 


Anti-pattern 2: Lexing then Parsing. Preprocessing the 
input with a dedicated lexer has no contextual awareness 
with which to selectively construct tokens. 


3.1 Dealing with Whitespace 


The most pressing issue to fix with the parser is whitespace. 
This is not difficult, but there are a couple of considerations: 


1. Whitespace should be read uniformly 
2. Whitespace should be unintrusive to the rest of the parser 


In particular, (2) will be properly addressed in Section 3.3, 
however (1) can be addressed now. The meaning of uniform 
in this context is to establish a convention on exactly where 
whitespace is read. A common temptation that newcomers 
make when writing parsers is to always consume whitespace 
around any given token (or worse between any two combina- 
tors); this is not ideal, since, if done “properly”, this technique 
will double up on reading whitespace. This is fairly benign 
as it is only wasteful computationally — the second attempt 
at reading whitespace will always consume nothing; but, for 
efficiency, there are two options: always consume leading 
whitespace, or always consume trailing whitespace (and ul- 
timately consume in the opposite direction exactly once at 
the end or beginning respectively). 

At first glance, it might appear as if both approaches are 
equivalent to each other, however, that is not the case. In 
fact, reading leading whitespace has a few flaws that trailing 
whitespace does not. Firstly, it is inefficient since it leads 
to excessive backtracking. Secondly, in the parsec family 
this falls afoul of (2): taking the second branch of an <i is 
conditional on the first having not consumed any input so 
to perform backtracking try combinator must be used. Ulti- 
mately, this means that whitespace is already intrusive to the 
rest of the parser, since try combinators must be inserted on 
the first branch of every <>» where whitespace is consumed. 
The third reason to avoid using the leading approach is that 
it can make acquiring position information from the parser 
more difficult, since the reported positions will be before the 
whitespace in front of the relevant token. 

As a result, it is much better to deal with trailing white- 
space and read whitespace once at the very beginning of 
the parser. If every token parser consumes whitespace after 
performing its duties, whitespace will be handled correctly. 
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Pattern 2a: Whitespace Combinators. Build a lexeme 
combinator dedicated to consuming trailing whitespace 
after a given lexeme. Build a fully combinator to consume 
initial whitespace and the end of input. 


For the expression grammar, actually parsing whitespace 
will be easy, since there are no comments or special white- 
space described in the language: 


whitespace :: Parser () 
whitespace = () <$ many (satisfy isSpace) 


fully :: Parser a — Parsera 
fully p = whitespace *> p <x eof 
lexeme :: Parser a — Parser a 
lexeme p = p<* whitespace 


The whitespace parser is responsible for reading zero or 
more space-like characters and comments*. The fully combi- 
nator is used to treat the top-level parser as a greedy unit that 
is required to reach the end of input, whilst also consuming 
the first chunk of whitespace. The lexeme combinator should 
be used to wrap up any token parsers to ensure they con- 
sume the trailing spaces. Ideally, these parsers should be kept 
in a module separately from the main grammar, to cleanly 
encapsulate them: after all, the token parsers exposed to the 
main parser should all deal with whitespace themselves. 


3.2 Tokens 


When working with a lexer where tokens are generated on 
demand, lexing can be context-aware meaning that certain 
tokens are only demanded within certain grammar rules 
only. The disadvantage to this approach, however, is it may 
require several lexes of a single token when the parser back- 
tracks. However, ideally, parsers should be constructed to 
minimise backtracking, so the favoured approach here will 
be to combine lexing and parsing into a single pass. 


Tokens and Atomicity. The key consideration when cre- 
ating lexing parsers with parser combinators is that they 
should be atomic so that if reading a token fails it consumes 
no input at all: it is either all or nothing. This allows the 
parser to try matching a similar token without the grammar 
needing to worry about performing any backtracking. In 
practice, backtracking is still performed, but it should be 
performed immediately as opposed to waiting until an al- 
ternative branch is taken. For libraries with some form of 
cut operation, this means cuts can often be inserted after 
tokens. In the parsec family, however, backtracking of this 
kind is done with the try combinator. Thankfully, try is very 
cheap when the parser succeeds, so the lexing parsers will 
use it liberally: it does not even matter if the token is already 
atomic (which is the case for single-character tokens). With 
that in mind here is the first relevant combinator: 


‘Learning from Section 2.4, manyl const () (satisfy isSpace) is better. 
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token :: Parser a — Parser a 
token = lexeme - try 


This combinator can be used to make a given parser into a 
token, which is to say that it consumes any trailing white- 
space and is completely atomic. In the parsec family, it is 
important to not put the whitespace parsing within the scope 
of the try: this would mean that if there were a syntax error 
caused by the whitespace itself, it would needlessly back- 
track for a while before the parser finally dies. This could be 
the case, for instance, if the language contained comments. 


Pattern 2bi: Tokenizing Combinators. Annotate termi- 
nals with a token combinator, built on lexeme, to atomi- 
cally parse them with whitespace consumed. 


Using the token combinator, the primitive tokens can be 
wrapped up and the whitespace problems fixed. As a re- 
minder, here is the parser up to this point: 


alpha = oneOf (['a'..'z']H['A'..'Z']) 
digit = oneOf['@'..'9'] 

alphaNum = alpha «p digit 

number = somel addDigit 0 digit 

ident = alpha «:> many alphaNum 


expr = precedence $ 
sops InfixL [Add <$ char '+', Sub <$ char '-'] « 


sops InfixL [Mul <$ char '*' ] * 
sops Prefix [ Neg <$ string "negate" | * 
Atom atom 


atom = char '(' *> (Parens <$> expr) <* char ')' 
«> Num <$> number <> Var <$> ident 


The question is which of these are tokens, and which of them 
are not. This is largely subjective, but here alpha, digit, and 
alphaNum are treated as building blocks whereas number, 
ident, parentheses, and the operators are tokens. 


number = token (somel addDigit 0 digit) 


ident = token (alpha <:> many alphaNum) 


expr = precedence $ 
sops InfixL [Add <$ token (char '+'), 
Sub <$ token (char '-') ] * 
sops InfixL [ Mul <$ token (char '*') ] * 
sops Prefix [ Neg <$ token (string "negate") | « 
Atom atom 
atom = Parens <$> 
(token (char '(') «> expr <* token (char ')')) 
«> Num <$> number <> Var <$> ident 


By ensuring all of the terminals of the grammar have been 
marked with token, no other whitespace handling needs 
to be performed for the expr in the parentheses. This has 
addressed the original problem so that parse expr "x + 7" 
now returns a correct successful result. 
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Token Validation. The previous parser has correctly han- 
dled each of the tokens within the grammar. However, using 
string for "negate" is inappropriate: it fails to enforce any 
separation between tokens. A better approach is to develop 
a distinct method for handling keywords. 


keyword — :: String — Parser () 


keyword k = token (string k +» notFollowedBy alphaNum) 
keys :: [String] 
keys = ["negate" | 

The keyword combinator is simple: it will parse the given 
string, but then ensure that it is not followed by another 


valid identifier character. Care must be taken to consume 
whitespace and make it atomic after the validation. 


Pattern 2bii: Keyword Combinators. Avoid using string 
with token for keywords. Use a keyword combinator that 
enforces that the keyword does not form a valid prefix of 
another token. 


By substituting token - string for keyword to handle the 
negate operator, parse expr "negatex" correctly returns 
Right (Var "negatex"). A similar system can be devised for 
longest-match operators, but there is no potential ambiguity 
with operators in the example parser. 


3.3 Using OverloadedStrings as a Facade 


While Section 3.1 provided a rationale for how to robustly 
handle whitespace in a grammar, and Section 3.2 neatly en- 
capsulated the combination of whitespace logic with other 
properties of the tokens, the solutions can justifiably be ac- 
cused of polluting the parser. Indeed, a suggested property 
of whitespace parsing was that it should not be intrusive 
to the main body of parser, but with the current setup, the 
parser is littered with tokens and keywords. 

The Haskell OverloadedStrings extension allows the 
regular Haskell syntax for string literals to represent some- 
thing else entirely. In other words, if there is an instance of 
the IsString type class available for a type s, string literals 
can represent values of type s implicitly. 


{-# LANGUAGE OverloadedStrings #-} 
class IsString s where fromString :: String — s 


Pattern 2c: Overloaded Strings. Hide tokenizing logic 
by allowing string literals to serve as parsers. 


The IsString type class is of particular interest where lex- 
ing is concerned, with a valuable instance being one for 
Parser () (to help GHC’s constraint solver, u~() is used): 


instance u~() => IsString (Parser u) where 
fromString str 
| elem str keys = keyword str 


| otherwise = () <$ token (string str) 
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This magical instance encapsulates both the use of token 
and the use of keyword in the parser. Except for the tokens 
number and ident themselves (which should be separated), 
the rest of the parser can be stripped of its lexing baggage: 


expr = precedence $ 
sops InfixL [Add <$ "+", Sub <3 ""-"] + 


sops InfixL [Mul <$ "*" «K 
sops Prefix [ Neg <$ "negate" | *K 
Atom atom 

atom = "(" «> (Parens <$> expr) «« ")" 


<> Num <$> number <> Var <$> ident 


Discussion. The final result of the lexing transformation 
is striking as it ends up eliminating noise from before lexing 
was even incorporated in: the original char combinators have 
been removed. This keeps it looking closer to the original 
grammar in form, more so for non-precedence grammars. 
Preferably, the combinators and non-string tokens them- 
selves should be kept in another module. 

In practice, parser combinator libraries often support some 
form of “lexer combinator generator”, where a specification 
of the language’s tokens are given and out pops the combi- 
nators to parse them. This mechanism, where it exists, will 
also handle whitespace just as described in Section 3.1, but 
there is still value in understanding the justification behind 
the canonical implementations. 


4 Abstracting Ast Construction 


A common realisation after writing a working parser is that 
the ast produced during parsing may need to be augmented 
with extra information from the parse: this can be any infor- 
mation that might be required for a later part of the overall 
pipeline, often the line and column numbers. This can be a 
frustrating realisation, as the parser and the data type needs 
to be modified to accommodate the new requirements, and 
the bookkeeping required to patch the parser is intrusive, 
undoing all the work of the other patterns! 


Problem 3: Bookkeeping. Asts built during parsing oc- 
casionally require parser metadata. 


To demonstrate both the problem and a taste of the dam- 
age done, the grammar will be augmented to now include 
assignments and statements. For the sake of illustration, the 
requirement is that variables and assignments require po- 
sition information, so that, at a later point in the program, 
scoping errors can be reported referencing the original posi- 
tions of the offenders. The new rules in the grammar are: 


(stmt) ::= (asgn) ‘;’ (stmt) | (asgn) 
(asgn) ::= (ident) ‘:=’ (expr) 

In the spirit of the type safety adopted in Section 2, the 
new datatypes will mirror the structure of these rules: 
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data Stmt = Seq Asgn Stmt | OfAsgn Asgn 
data Asgn = Asgn String Expr (Int, Int) 
data Atom = Num Int | Var String (Int, Int) | Parens Expr 


As discussed, the Asgn constructor has been given an extra 
argument to accommodate the extra position information 
required for scope errors. The same has been done for the 
Var constructor in Atom. The implementation of the parser 
is familiar, making use of infixr1 to handle statements: 


stmt = infixr! OfAsgn asgn (Seq <$"";") 


asgn = pos «««> (Asgn <$> ident <*« ":=" «> expr) 


atom = "(" «> (Parens <$> expr) «« ")" 


<> Num <$> number <p pos <*> (Var <$> ident) 


The atom parser is the only parser that needs to change to 
accommodate the new requirements. The new intrusion into 
the parser is the pos :: Parser (Int, Int) combinator, which 
yields the required information. Perhaps interestingly, the 
combinator is merged in a somewhat counter-intuitive order- 
ing, applied on the left to the rest of the partial constructor. 
This is because the position has been placed at the end of 
the constructor, but positions should be obtained before any 
tokens have been read otherwise they will point at the next 
token after the variable or assignment. This serves as further 
justification of the principle that only trailing whitespace 
should be consumed. This has already introduced noise into 
the parser, if position information were required for the op- 
erators as well, the problem would get much worse. 


Anti-pattern 3: Inline Bookkeeping. Incorporating meta- 
data inline into the parser is intrusive and brittle. 


4.1 Smart Constructors for Parsers 


Smart constructors are a well-known Haskell technique for 
augmenting a datatype with additional builder functions to: 


e Define compound structures built of core constructors 

e Perform light-weight validation on constructor inputs to 
guarantee invariants hold true 

e Perform light-weight optimisations on constructors that 
compose their arguments 

e Simplify the ari by providing default values 

e Processing constructor arguments into normal forms 


They are simply regular Haskell functions, often with a 
similar name to the constructor they are abstracting and the 
prefix mk (or make). In general, given a regular constructor 
C: A, >... A, — T, a smart constructor typically has 
the form mkC :: B; > ... ~ Bm — T where the required 
arguments A,...A, are synthesised from other values B,...Bm 
and used to instantiate C to make a T. A simple but effective 
concept, they appeared in Section 2.3: the smart constructors 
ops and sops both helped to simplify the api by providing 
default wrappers; and gops extended the Op constructor by 
combining a list of parsers with choice. 
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When parsing it is convenient to have a lifted smart con- 
structor of shape mkC :: Parser Ay — ... > Parser A, > 
Parser T, since the value C is built during parsing. When the 
arguments are plainly parsed then combined in sequence, the 
constructor can be implemented by liftA, F, which abstracts 
away the <$> and <*» normally required to combine results. 
However, smart constructors can also be made to perform 
any of three common tasks: 


e Perform bookkeeping as the AsT is built 
e Perform semantic validation on the AsT nodes 
e Perform normalisation on the AsT nodes 


All of these are properties of the underlying constructor, and 
may or may not be required. An example of bookkeeping may 
be extracting position information (this is possible since the 
smart constructor operates in the parsing effect), an example 
of semantic validation might be to ensure that integers do 
not overflow and normalisation might arbitrate between two 
ambiguous constructions without backtracking. 


Pattern 3a: Lifted Constructors. Use smart constructors 
to decouple bookkeeping logic from the parser. 


The first of the three advertised properties above can im- 
mediately address the issues with the current parser. The 
smart constructors for Asgn and Var should both handle po- 
sition tracking, but the others, for instance, Num, need do 
nothing but lift the underlying constructor: 


mkAsgn :: Parser String — Parser Expr — Parser Asgn 
mkAsgn var body = pos <*> (Asgn <$> var <*> body) 


mkVar var = pos «««> (Var <$> var) 
mkNumn = Num <$> n 
mkParens x = Parens <$> x 


asgn = mkAsgn (ident <* ":=") expr 


atom = "(" x» mkParens expr <* ")" 
<> mkNum number «<> mkVar ident 


The parsers now have been simplified further: the work 
needed to combine various results has been abstracted into 
the smart constructors, and the parser does not need to be 
aware of what additional work needs to be performed on 
these sub-parts. The advantage to this approach is that if 
the position information was no longer needed, it can be 
removed without changing the parser: this adheres nicely to 
the Single-Responsibility Principle of Software Engineering [4, 
20, 23] as the parser only cares about building results, not 
how they are built, which is the sole job of the constructors. 
The Lifted Constructors pattern works well for linear seg- 
ments of parser, where the arguments to constructors are 
situated next to the application of the constructor itself. But 
this is not always the case: consider the constructors for the 
operators in the precedence table. In these instances, the 
same principle can be applied, but in a different shape. 
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Pattern 3b: Deferred Constructors. Defer the construc- 
tion of an AST node to abstract the bookkeeping when its 
arguments are not immediately available. 


Here constructors are returned by parsers so that their 
arguments can be applied to them later but without any re- 
quired metadata in the type. In the basic case, the constructor 
can be returned with pure: 


mkAdd :: Parser (Expr — Term — Expr) 
mkAdd = pure Add 


mkNeg :: Parser (Negate — Negate) 
mkNeg = pure Neg 


While it might appear like this is not particularly useful, it 
still means that if position information needs to be added to 
a node, the parser does not need modification, for example: 


mkMul :: Parser (Term — Negate — Term) 
mkMul = (Ap x y > Mul xy p) <$> pos 


In this case, the position can be read immediately and applied 
ahead of time, with the other arguments deferred to later. 
The remainder of the parser is modified as follows: 


expr = precedence $ 
sops InfixL [mkAdd <* "+", mkSub <* "-"] + 
sops InfixL [mkMul <« "*" ] 
sops Prefix [mkNeg <* "negate" | K 
Atom atom 


x 


The intrusion here is minimal, but the flexibility increased: 
this is more robust to change than any previous version. 


Discussion. The smart constructor pattern is ultimately a 
simple one, but very flexible. It allows the parser maintainer 
to improve the separability of their code and removes yet 
more combinator noise to bring the parser closer still to the 
original look of the grammar. 

It can be improved further, however: in other languages, 
for instance, the constructor can easily be overloaded so 
that the name does not require any prefix attached; and, 
in Haskell, the mechanical nature of the position tracking 
variant can be leveraged by Template Haskell to automati- 
cally generate the smart constructors for a datatype. Pattern 
synonyms [25] can be used as an alternative to smart con- 
structors to construct compound structures, however the 
result must be pattern matchable. 


5 Improving Errors: Anticipating Mistakes 


An important part of a parser (up till now overlooked by this 
paper) is generating meaningful and helpful error messages. 
Writing good error messages is incredibly subjective as an 
exercise and so the focus here is not so much on what makes 
good error messages but instead on providing tools that the 
parser writer can keep in mind when considering their errors. 
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Many libraries have a few combinators in common for errors 
that will be leveraged here: 


(«?>) :: Parser a — String — Parsera_ --infix 0 
unexpected : String — Parsera 

ail :: String — Parser a 

lookAhead :: Parser a > Parser a 


notFollowedBy :: Parser a — Parser () 


The <?> combinator, pronounced “label” assigns a name to a 
parser to identify it in an error message. As an example: 


digit = oneOf ['0'..'9'] <2 "digit" 


As an aside, adding whitespace is hardly ever the solution to 
a syntax error (with the exception of indentation-sensitive 
grammars), so ideally it should be hidden from any syntax 
errors to reduce noise. This can often be accomplished by 
using (<?>"") to hide the label of the whitespace. 

The unexpected and fail combinators both fail immedi- 
ately, as indicated by their ability to seemingly produce a 
value of any type a out of thin air if they were to succeed. 
Normally, unexpected changes the part of the error message 
representing the problematic token, and fail adds a bespoke 
message to the error’. The specific combinators required for 
each technique may differ depending on the library. 

The lookAhead and notFollowedBy combinators are pos- 
itive and negative look-ahead respectively. If lookAhead 
succeeds, it does so without consuming any input, and if 
notFollowedBy’s argument fails, then the combinator suc- 
ceeds, and vice-versa; notFollowedBy never consumes input. 


Conditionals. To provide a basis for the upcoming discus- 
sion, the grammar is extended one final time with conditional 
statements and basic comparisons: 


(stmts) ::= (stmt) ‘;’ (stmts) | (stmt) 


(stmt) = (asgn) | (ifStmt) | ‘skip’ 
(ifStmt) = ‘if? (comp) ‘{’ (stmt) ‘}¥ ‘else’ “{? (stmt) ‘} 
(comp) ::= (expr) “<’ (expr) 


Since there is more than one type of statement now, the 
sequence in (stmt) has been split out into a (stmts) rule 
instead. This change, along with the new components, must 
be incorporated into the ast: 


data Stmts = Seq Stmt Stmts | OfStmt Stmt 
data Stmt = Asgn String Expr (Int, Int) 

| If Comp Stmt Stmt | Skip 
data Comp = Less Expr Expr 


It is perhaps a little overkill to create a new If datatype 
when Stmt will do, but, other than that, the datatype aligns 
perfectly with the grammar. The parser makes use of the 
same techniques used so far in the paper®: 


5]t may or may not also remove other information from the error. 
® Assume that the additional keywords have been added to the lexer and 
new smart constructors have been created. 
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stmts = infixr1 OfStmt stmt (mkSeq <*« ";") 

stmt = asgn «p ifStmt <» mkSkip <«* "skip" 

ifStmt = mklf ("if" *» comp) ("{" «> stmt <* "}") 
("else" x ("{" «> stmt <« "}")) 

comp = mkLess expr ("<" «> expr) 


Happily, this new syntactic extension offers some non- 
conventional syntax ripe for exploration. One example is 
that the body of an if statement cannot be empty: 


parse (fully stmts) 
"if @< 1 {} else { x := 10 }" 
(1, 11): unexpected "}" 
expected identifier, if, skip 
> if @< 1 {} else { x := 10 } 


This error could be improved by marking it using <?> to 
give the three alternatives the name “statement”. This is 
fine for experts, but it would be nice to explain what that 
means for those not versed in such terminology. This can be 
achieved by adding a fail combinator: 


stmt = (asgn «<p ifStmt <«» mkSkip <* "skip" 
<> "statement") 
« fail "statements consist of skips, [..]" 


This produces a more friendly error: 


(1, 11): unexpected "}" 
expected statement 
statements consist of skips, [...] 
> if @< 1 {} else { x := 10 } 


As another example, unlike many mainstream languages, 
if statements not only require an else branch, but a trailing 
semicolon too: 


parse (fully stmts) 
"if @< 1 { skip } else { skip }\nskip" 


(2, 1): unexpected "s" 
expected ";", end of file 
> skip 


A 


Here, the user has fallen afoul of this restriction, assuming 
that a semicolon is not needed after braces. Unfortunately, 
the error message gives no indication that this is the cause 
of the issue, other than suggesting adding a semicolon. 


Problem 4: Helpful Errors. The parser may wish to pro- 
vide guidance about avoiding common mistakes. 


This example will be revisited later. Instead, suppose that 
the user now realises a semicolon is necessary, and now 
incorrectly assumes that all braces require a semicolon after: 
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(1, 18): unexpected ";" 
expected else 
> if @< 1 { skip }; else { skip } 


Again, it is clear that writing an else instead of the semi- 
colon is the way to fix the problem, but fails to explain to 
the user why in a more useful way. The user has fallen afoul 
of the choice to make semicolons an operator instead of a 
delimiter, and again, there is no indication in the error mes- 
sage that would help the user understand what the rules are 
and why they have gotten them wrong. 

A first attempt may use the same strategy used with state- 
ments, by adding a bespoke fail firing if else is not parsed: 


elseClause = "else" «<p fail "semicolons are not [..]" 


Whilst this seems fine at first glance (and works for the “true- 
positive” input above), it generates nonsensical messages in 
the presence of “false-positive” instances: 


(1, 18): unexpected end of input 
expected else 
semicolons are not allowed between if and else 
> if @< 1 { skip } 


A 


Anti-pattern 4: Unconditional Errors. Addressing a 
common issue with a fixed error message can produce 
misleading errors. 


5.1 Using Positive Lookahead 


The problem with the naive approach using solely fail on its 
own is that it has no awareness of the surrounding input: 
clearly, reporting that semicolons are illegal when found 
between if and else is only valid when there actually is 
a semicolon between them. The previous attempt failed to 
respect that idea, instead always emitting the message. 


Pattern 4a: Verified Errors. Use lookAhead to verify that 
contextual obligations are met before raising an error. 


In contrast, lookAhead can be used to inspect the next part 
of the input to determine whether or not the conditions are 
met to make the error make sense. In the previous scenario, 
it is important to ensure that a semicolon has been written 
before referencing them, so the fail will be guarded. To help 
ensure that the added error widgets are not seen as part of 
the grammar, prefix their name with an underscore. 


elseClause = "else" <> _semi 
= lookAhead (";"*> "else" 
*> fail "semicolons are not [..]"<?>'"") 


_semi 


By predicating the fail behind the parsing of a semicolon and 
an else the message only appears when appropriate. The 
widget should be given the “hidden” error label. 
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5.2 Using Negative Lookahead 


While the Verified Errors pattern is useful, it is less appropri- 
ate when the context of the error depends on non-local infor- 
mation in the grammar or when there are multiple places it 
can occur. As an example, informing the user that assigning 
comparisons to variables is illegal cannot be performed with 
lookAhead, since the valid continuations of assign: end of 
input, semicolons, or closing braces are valid elsewhere. If a 
lookAhead was placed at each of these places, then bad in- 
put like "skip <" may be reported as a bad assignment and, 
even if worked properly, it would duplicate the logic around 
the parser. Placing a lookAhead in the assignment, on the 
other hand, would only work with lookAhead comp...«»expr, 
which always parses valid input twice, so it is not ideal. 


Pattern 4b: Preventative Errors. Use notFollowedBy to 
rule out illegal input or else raise an error. 


A reasonable substitute is to use notFollowedBy to ensure 
that the right-hand side of an assignment is an expression 
that does not form part of an otherwise valid comparison. 
As with the Verified Errors pattern, the error widget can be 
distinguished by prepending its name with _no. 


_noComp = 
notFollowedBy ("<" «> expr) 
<> unexpected "<" 
«> fail "\"<\" cannot be used in assignments" 
<?> "end of assignment" 
asgn = mkAsgn (ident <* ":=") expr <* __noComp 


This widget ensures that only when a < appears after an as- 
signment with another expression will the user be informed 
that comparisons are illegal in assignments. 

This technique was applied in Section 3.2 to prevent key- 
words from being followed by a letter. It can also be used 
after a closing brace to guard against another statement being 
written before a semicolon, resolving the very first exam- 
ple. Finally, it can also report that non-associative operators 
cannot be chained in the precedence combinator. 


Discussion. These final two patterns are more situational 
than the rest presented in this paper but no less useful. How- 
ever, the formulation using lookAhead and notFollowedBy 
is not perfect, and depends on how a library arbitrates be- 
tween different messages. But they are a good heuristic for 
generating more bespoke, and applicable, errors. 


6 Related Work 


Object-oriented design patterns [9] are ubiquitous in the 
oop world, being widely applied and embraced. There are 
instances of their use for the design and creation of parsers 
themselves: Nguyen et al. [22] uses several of the classic 
oop design patterns to create easily extensible LL(1) parsers, 
where the elements of the grammar are decoupled to isolate 
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the scope of any changes; Schreiner and Heliotis [26] lever- 
age many of the same patterns in the internal design of a 
parser generator tool called oops3. 

The ANTLR parser generator tool [24] uses the Visitor pat- 
tern to separate the building of a semantic action from pars- 
ing by exposing a visitor to process a generic parse tree; 
this contrasts with our Lifted Constructors pattern to decou- 
ple the concrete construction of Asts from the parser which 
generates them. The difference in approach is a form of defor- 
estation where the parser itself acts as a fold over the parse 
tree. ANTLR also uses the Observer pattern in the form of lis- 
teners to allow bespoke errors to be generated depending on 
the surrounding context, contrasting with our parser-driven 
Verified Errors and Preventative Errors patterns. 

While there is an abundance of parser combinator tutori- 
als [7, 15, 27, 28], they focus on how to design the libraries 
themselves, and do not offer any substantial discussion on 
reusable patterns for writing parsers. This is surprising since 
design patterns find such popularity in other disciplines. 
K6vesdan et al. [18] do document specific design patterns in 
the parsing space, but these are directed at an architectural 
level, describing the differences and applications of tech- 
niques such as hand-rolled parsers, parser generators, and so 
on. In this context, they would describe parser combinators 
as a pattern in themselves, and not describe their underlying 
nuances. As far as we are aware, our patterns have not been 
presented in the true design pattern style before, instead 
existing in folklore. 

Different parsing algorithms such as Earley’s algorithm 
[6], LALR, and LR can all handle left-recursion without any 
changes to the grammar or the parser [1]. There are also 
examples of libraries that use memoisation to allow left- 
recursion [8, 16, 17]. Danielsson [2] presents a parser com- 
binator library using guarded co-induction to ensure that 
left-recursive grammars are still productive. Devriese and 
Piessens [5] present a parser combinator library that uses 
the left-corner transform [21] to automatically remove left- 
recursion. In all of these cases, the Chains pattern is not 
needed. However, variants of the Precedence Tables pattern 
still appear in many parser generator tools, like Happy [11]. 


7 Conclusion 


The final state of the grammar is shown in Figure 3, and 
the final state of the parser (with the Ast, lexer, and smart 
constructors omitted) is shown in Figure 4, along with the 
definitions of the error widgets that are used. 

The effect of the Precedence Table pattern is that the prece- 
dence and fixities of the expression portion of the grammar 
are clearly represented in the expr rule in a way that lever- 
ages strong types to ensure the correctness of the table’s 
fixities and ordering. While this does make the parser di- 
verge from the grammar, it provides an easy way to establish 
information about each operator. The Heterogeneous Chains 
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(stmts) := (stmt) ‘;’ (stmts) | (stmt) 
(stmt) = (asgn) | (ifStmt) | ‘skip’ 
(asgn) = (ident) *:=’ (expr) 
(ifStmt) «= ‘if’ (comp) ‘{? (stmt) “} 
‘else’ ‘{? (stmt) ‘}¥ 
(comp) s= (expr) ‘<’ (expr) 
(expr) = (expr) ‘+ (term) | (expr) ‘-’ (term) | (term) 
(term) «= (term) ‘*’ (negate) | (negate) 
(negate) := ‘negate’ (negate) | (atom) 
(atom) ::= ‘C (expr) ‘)’ 


| <number) | (ident) 


Figure 3. Final Grammar 


pattern has been used to handle the separation of statements 
using semicolons: here, the right-associativity of sequencing 
is enforced using the stronger types offered by infixr1. If the 
grammar did not specify the associativity for us, however, 
the Homogeneous Chains pattern could have been used to 
provide a more flexible implementation. 

Using the Overloaded Strings pattern, the parser is void 
of any references to tokens or whitespace. This allows the 
parser to adopt a form closer to that of the grammar by using 
string literals. Behind the scenes, the Whitespace, Tokeniz- 
ing, and Keyword Combinators patterns are used to cleanly 
manage lexing and consume whitespace consistently. 

The Lifted, and Deferred Constructors patterns have been 
employed to abstract the creation of the ast from the parser, 
in the process abstracting away many sequencing combina- 
tors. While parts of the ast this parser produces may require 
position information, that is not evident here and can be 
easily managed separately. 

The Verified and Preventative Errors patterns have been 
employed in the form of _semi and _noComp to provide 
some bespoke errors to the user of the parser. Whilst they 
distract a little from the overall parser, they can be kept 
separate and distinguished with their naming convention. 

In all, the use of all of these patterns has yielded a clean 
and maintainable parser that can be easily extended as the 
language grows, and we have no doubt there are plenty more 
patterns waiting to be documented! 
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stmts = infixr1 OfStmt stmt (mkSeq <« ";") 

stmt = asgn « ifStmt <» mkSkip <* "skip" 

asgn = mkAsgn (ident <« ":=") expr <* _noComp 

ifStmt = mklf ("if" *» comp) ("{" *> stmt <« "}") 
(("else" <> _semi) *> "{" *> stmt <* "}") 

comp = mkLess expr ("<" «> expr) 

expr = precedence $ 


sops InfixL [mkAdd <« "+", mkSub <* "-"] « 


sops InfixL [mkMul <« "*" ] K 
sops Prefix [mkNeg <* "negate" | «K 
Atom atom 

atom = "(" «> mkParens expr <* ")" 


<> mkNum number <» mkVar ident 


lookAhead (";" «> "else" 
«> fail "semicolons are not [...]" «2 "") 
_noComp = 


_ semi 


notFollowedBy ('"<" «> expr) 
<> unexpected "<" 
«> fail"\"<\" cannot be used in assignments" 
<> "end of assignment" 


Figure 4. Final Parser 
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